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ABSTRACT 



An item-selection algorithm to neutralize the differential 
effects of time limits on scores on computerized adaptive tests is proposed. 
The method is based on a statistical model for the response- time 
distributions of the examinees on items in the pool that is updated each time 
a new item has been administered. Predictions from the model are used as 
constraints in a 0-1 linear programming (LPl model for constrained adaptive 
testing that maximizes the accuracy of the ability estimator. The method is 
demonstrated empirically using an item pool from the Armed Services 
Vocational Aptitude Battery and the responses of 38,357 examinees. The 
empirical example suggests that the algorithm is able to reduce the 
speededness of the test for the examinees who otherwise would have suffered 
from the time limit. Also, the algorithm did not seem to introduce any 
differential effects on the statistical properties of the theta estimator. 
(Contains 9 figures and 14 references.) (SLD) 
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Differential Speededness in Adaptive Testing - 1 
Abstract 

An item-selection algorithm to neutralize the differential effects of time limits on scores on 
computerized adaptive tests is proposed. The method is based on a statistical model for the 
response-time distributions of the examinees on the items in the pool that is updated each time 
a new item has been administered. Predictions from the model are used as constraints in a 0-1 
linear programming (LP) model for constrained adaptive testing that maximizes the accuracy 
of the ability estimator. The method is demonstrated empirically using an item pool from the 
Armed Services Vocational Aptitude Battery (ASVAB). 
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Using Response-Time Constraints to Control for 
Differential Speededness in Computerized Adaptive Testing 

Examinees with the same ability differ in the amount of time they need to complete a 
test item. As a consequence, some examinees may be affected unfavorably by the presence of 
a time limit and not reach the end of the test. Also, the question of how to score unreached 
items involves a series of difficult problems. First of all, in a conventional paper-and-pencil 
test it is impossible to discriminate between unreached items and reached items that were left 
unanswered because their answers were unknown. But even if it were exactly known which 
items are not reached (as is possible in computerized testing), scoring would remain a 
complicated issue. If the test is unspeeded and not too difficult, items not reached might be 
viewed as missing at random and ignored when the test is scored (for the notion of data 
missing at random, see Gelman, Carlin, Stem, & Rubin, 1995, sect. 7.4). However, if the test 
is speeded, the examinees are faced with a speed-accuracy tradeoff and may choose different 
strategies of responding to the items. The test then needs to be scored under a model that 
explains such choices, but realistic models for doing so are not yet available. 

Empirical studies of the relation between response time and ability have become 
possible through the introduction of computerized testing but are still hard to find. A 
favorable exception is a recent study of the response times in a Field test of a computerized 
version of the National Board of Medical Examiners (NBME) Step 2 Licensure Exam 
(Swanson, Featherman, Case, Luecht, & Nungester, 1997, April). In this study, responses to 
items in linear subtests were timed, and no correlation between response time and ability was 
found for the various subtests. However," a replication of the study for the NBME Step 1 
Licensure Exam showed moderate correlation towards the end of the subtests (Swanson, 
personal communication, December 18, 1997). As the Step 1 Exam had a more stringent time 
limit, the results seem to suggest that for a test with a mild or ineffective time limit, ability 
and response time are uncorrelated but that a positive correlation is induced if the time limit is 
tightened. 

It seems important to control tests for speededness. Such control would not only make 
the assumption of a traditional (unidimensional) logistic item response theory (IRT) model 
more realistic but also prevent scoring problems due to unreached items. However, for a 
conventional linear test, the only two options seem to be to reduce the length of the test or 
increase the amount of time available. Given the variability in response times between 
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Differential Speededness in Adaptive Testing - 3 

examinees, the former would imply loss of accuracy in ability estimation for the faster 
examinees and the latter an increase in administration costs for all examinees. 

In computerized adaptive testing (CAT), an attractive solution to this test design 
dilemma is possible. If a model for the response-time distributions of the examinees on the 
items in the pool is available, the actual response times on the items administered to the 
current examinee can be recorded and used to update the estimates of these distributions for 
the remaining items in the pool. These distributions can then be used to constrain the selection 
of the next items in the test to give the test the same degree of speededness for all examinees. 
Of course, this procedure is only feasible if a model for the response-time distributions with a 
satisfactory fit to actual response-time data is available. 

It is the purpose of this paper to present an algorithm for adaptive testing that builds on 
this idea. In addition to the usual update of the estimate of the ability parameter in an IRT 
model, a lognormal model for the response-time distributions is used to update the response- 
time estimates for the items in the pool. Response-time constraints are derived from these 
estimates and imposed on the item selection using a 0-1 linear programming (LP) algorithm 
for constrained adaptive testing. The following sections of this paper introduce the model and 
the algorithm. The algorithm is then studied empirically using an item pool and estimates of 
response-time parameters for an adaptive version of a test from the Armed Services 
Vocational Aptitude Battery (ASVAB). The main purpose of the study was to ascertain the 
effects of the response-time constraints on the statistical properties of the ability estimator as 
well as the actual times needed to complete the CAT. 

Model for Response Times 

The response time of examinee j on item i is denoted by a variable T,j. The variable is 
assumed to be random because replications of tasks by the same subject are generally known 
to show variation in the time needed to complete them (Luce, 1986, sect. 1.2; Townsend & 
Ashby, 1983, chap. 3). The following decomposition for the (natural) logarithm of Tjj is 
assumed as a model for its distribution; 

In Tij = M- + 8i + Xj + Eij, (1) 



with 
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Eij ~ N(0,c 2 ) , (2) 

where \x is the grand mean or general response time level for the item pool and population of 
examinees, Tj is an effect parameter for the slowness of examinee j, & for the amount of time 
demanded by item i, and e»j is a normally distributed residual or interaction term. Together, (1)- 
(2) imply a lognormal distribution for the observed response times of a fixed examinee taking a 
fixed item. The model was proposed in Scrams and Schnipke (in preparation). The effect terms 
Tj and & are defined to have expectations equal to zero across examinees and items, respectively. 
The marginal distribution of log response time across examinees for a fixed item also depends 
on the distribution of t. This distribution is examined as one test of the goodness of fit of the 
model later in this paper. 

Observe that the distributions in (l)-(2) vary in location across examinees and items 
but have a common variance. The last assumption is stringent but allows us to estimate the 
parameters in the model in a straightforward way. In addition, since the model will be used to 
constrain item selection using only a percentile in the upper tail of the distributions towards 
the end of the test, a slight misfit of the model would not seem to lead to serious item 
selection errors . However, whenever using the model, it should be standard practice to check 
its assumptions. 

For future reference, note that 



— Eij (In T jj) » 


(3) 


i =Ej(ln Tjj) * ft, 


(4) 


j = Ei(ln Ttj) - ft, 


(5) 


2 = Ejj[ln Tjj - 8i - Tj] 2 - 


(6) 



Throughout this paper subscripts at expectation signs denote indices over which expectations 
are taken. 

A different use of the lognormal distribution as a model for response times is made in 
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Thissen's (1983) model for timed testing. In his model, the lognormal distribution is 
parameterized to be dependent on the latent ability measured by the items and becomes part of 
the likelihood function used for estimating the examinees' ability from the joint distribution of 
the item scores and the response times. Thissen also found adequate fit for a series of tests, 
except for one which showed an overrepresentation of fast responses due to the relative 
easiness of its items. Other distributions used to study response times on test items are the 
Weibull (Roskam, 1997) and the gamma distribution (Verhelst, Verstralen & Jansen, 1997). 
In a previous study, the lognormal distribution showed a good fit to the response time 
distributions on an item pool from the Armed Services Vocational Aptitude Battery 
(ASVAB), outperforming the Weibull and gamma distributions (Schnipke & Scrams, 1997). 
These results will be further discussed below when an empirical example for the ASVAB is 
presented. 



IRT Model 

It is assumed that the item pool has been calibrated using an IRT model. In the 
empirical example later in this paper, the item pool was calibrated using the 3-parameter 
logistic (3-PL) model. The model describes the probability of a correct response on item i as: 

Pj (0) = Prob{Ui = 1 1 6} = Ci + 0 -cj){ 1 + exp[-a i (0-bi)]r 1 . (7) 



where 6 is the unknown ability of the examinee and ai^ [0,°°], bj£ [- 00 , 00 ], and qe [0,1] are 
the discrimination, difficulty, and guessing parameter for item i, respectively (Lord, 1980, chap. 
2 ). 

A key quantity in IRT is Fisher's information on the unknown ability parameter. For a 
test of n items, the measure is defined as: 

Iu, u n In L(0l U]v.U n )) • (*) 

In (8), L(0IU|» *»Un) is the likelihood statistic associated with the (random) response vector 
Ui,...,U n . For the 3-PL model in (7) it holds that 
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( 9 ) 



with 




( 10 ) 



(Lord, 1980, chap. 5). The term in (9) for item i will be denoted as li(0) and is the item 
information used in the maximum-information item selection criterion in the CAT algorithm 
below. 



It is assumed that the item pool consists of items indexed by i=l,...,I. In addition, the 
CAT is assumed to consist of items indexed by k=l,...,n. Thus, index ik represents the event of 
the ith item in the pool being administered as the kth item in the CAT. The index values of the 
first k-1 items in the CAT are denoted by the set Sk-i = {ii» — • The remaining items in the 
pool are denoted by the set Rk = { l,...,I}\Sk-i. The kth item in the test is chosen from the set Rk. 

The basic idea is to update an estimate of the examinee's slowness parameter tj in ( 1 )- 
(2) during the test given accurate estimates of p, item parameters &, i=l,...,I* and the residual 
variance, <j 2 . The improved estimates of tj are used to update projections of the time needed to 
complete each of the remaining items in the pool. The next item is then selected subject to a 
constraint based on these projections as well as the time available to complete the remaining 
portion of the test. A Bayesian framework is used to update the response time projections 
whereas the response time constraints are incorporated in the item selection procedure using a 0- 
1 linear programming (LP) model for constrained CAT that maximizes the information on 0 in 
the test and also allows for additional constraints that can be used to guarantee its content 
validity. 



Response-Time Constraints in CAT 
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U pdating Response Time Estimates 

It is assumed that &, i=l,...,I, and o' have been estimated precisely enough to be 
considered as known. Estimates can easily be obtained from the response times in the calibration 
sample using the equations in (4) and (6). However, if examinee j is tested, Xj is an unknown 
parameter; it is assumed to have a normal prior distribution: 

Xj-NQioj.ogp. (ID 

The model in ( 1 )-(2) yields a normal likelihood with unknown mean and known 
variance that has the family of normal distributions as its conjugate prior (Gelman, Carlin, 
Stem & Rubin, 1995, sect. 2.6). Hence, noting Xj = ln(ty) -|X - Sj +Eij, the posterior distribution 
of xj, after the response times on items ii,...,ik-i have been recorded, is normal with mean and 
variance: 



E (tjlt il j....,t, k . 1 p-[o 2 Hoj+ <*oj X ( |n (t ip j-H-8j p » / [o' 2 + (k-i)CT^j] (12) 

p = \ 

Var(T j lt il j,..,t ik . 1 j) = CTg j CT 2 /((k-l)CTg j + CT 2 ) • (13) 



Also, the predictive density for the response time of examinee j on item i after items 
ii,...,ik-i is normal with mean equal to the posterior mean and variance equal to the sum of the 
prior and posterior variances, respectively: 



E(ln T ij I tijj t ik .jj) - E(ij I tip j’—’tik-i j) + f i + 8i 



(14) 



Var ( |n Tij I = o§ j + Var( Tj I tj j.-.t*., j) • 



(15) 



As the examinees are assumed to be exchangeable, an obvious choice for the 
parameters in this prior is to equate them to the mean and variance of the population of 



examinees: 



H 0 j = E(T) = EjEiij(lnTij)=0, 
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(16) 




(17) 



for all j. Using ( 1 6)-( 1 7), the mean and variance in ( 14)-( 15) specialize to: 




I (ln(t ipj )-H-8 ip ) 

P=j 



(18) 




Var(lnT i jlt il j,..,t ik . 1 j) = 




(19) 



l + (k-l)a?/CT 2 ' 



Because 5j , i=l,...,I, and (T 2 are assumed to be estimated using (4) and (6), respectively, the 



expressions in (18) and (19) have only known constants and are easy to calculate. 

Let t|j be the ath certain percentile in this posterior predictive density for ln(Tjj) 

transformed back to the original time scale. The choice of item ik will now be constrained 
using this percentile for all remaining items in the pool, ieRk. As will become clear later, it 
makes sense to define a to be dependent on k and choose a percentile near the middle of the 
density in the beginning of the test and move to percentiles in the upper tail towards the end of 
the test. 

Constrained CAT Algorithm 

The kth item is selected according to an algorithm for constrained CAT presented in 
van der Linden and Reese (1998). To select the initial item, the algorithm first selects a full 
test that meets all constraints to be imposed on the selection of items in the CAT and has 
maximum information at the initial ability estimate. The item actually administered is the one 
from the assembled test with maximum information at this ability estimate. At each next step, 
the test is reassembled to have maximum information at the updated ability estimate fixing the 
items already administered. Again, the item to be administered is selected from the new 
portion of the test to have maximum information at the updated ability estimate. The 
procedure is repeated until the last item is selected. 
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The fact that a full test is assembled at each step rather than a single item is to keep 
future item selection feasible with respect to the set of constraints. Because both the test 
assembly and the selection of the individual item have the objective of maximum information, 
the ability estimator can be expected to be maximally informative too. All test assembly is 
done while the examinee takes the test and is based on a 0-1 LP model that represents all test 
specifications. 

To discuss the model, decision variables \ u i=l,...,I, are introduced that take the value 
1 if item i is selected in the test and the value 0 otherwise. The total amount of time available 
for the CAT is denoted as t to t. In addition, it is assumed that the composition of the test is 
constrained with respect to a variety of categorical attributes, such as content, cognitive level, 
and item format. These attributes partition the item pool into a collection of sets V g , g=l ,...,G, 
each of which is defined by a (combination of) attribute value(s). Also, the composition of the 
test can be constrained with respect to several quantitative attributes ahi, h=l,..,H, for example, 
word counts, exposure rates, and IRT parameter values. Finally, let &k-i be the estimate of 0 
after k- 1 items have been administered. 

The decision variables are used to formulate the following linear model for selecting 
item k for the current examinee: 

I 

maximize X li(0k-l)*i ( 2 ®) 

i = l 



subject to 



X tij Xi+ X k x i - ^ot » 
ieSk-l ieRk 



( 21 ) 



X xi = k- 1 , (22) 

iG Sk-l 



X Xi = n 

i=l 



( 23 ) 
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£ xi>n^, g=l,...,G, (24) 

ieVg 

Xxi<nf), g=l,...,G, (25) 

ieVg 

i a h iXi>nl 1 ,) , h=l,...,H, (26) 

1 = 1 

S a w xi < nf. 2 ', h=l,...,H, (27) 

isl 

xj ^ {0,1}, i=l,... f I. (28) 



The objective function in (20) optimizes the information in the test at ©k-1 • The total length of 
the test is set at n items in (23), whereas (22) fixes the values of the. decision variables of all 
items that have already been administered to the examinee at 1. The key constraint in this paper 
is the one on response times in (21) which requires the remaining n-k+1 items to be selected 
such that the sum of the (X k th percentiles of their predicted response-time distributions plus the 
actual response time on the first k-1 items not be larger than the total amount of time available. 
Note that these percentiles are now defined to be dependent on the rank of the item in the test. In 
(24)-(25), the numbers of items with the various attribute categories are required to be between 



lower and upper bounds ng^ and n^ 2 \ respectively. Finally, the constraints in (26)-(27) 
guarantee that the sums of values for the various quantitative attributes are between the bounds 



n 



L 0 



and n jj^ . 

At step k, the model thus selects n-k+1 new items from the set Rk-|. The item actually 



administered is the one selected from this set that is most informative at ©k-1 . The cycle is 
then repeated to select item k+1. 



As already noted, it makes sense to choose t“ k close to the means of the posterior 

predictive response time densities in the beginning of the test but to move towards their upper 
tails later on. This suggestion is motivated by the fact that the sum of the mean predicted 
response times is a good predictor of the actual time for a large set of items but a more 
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conservative predictor is needed if the set becomes smaller. 

A larger selection of constraints can be added to the model to deal with possibly 
remaining test specifications than the set used in (20)-(28). Several examples of possible 
additions are given in van der Linden (1998) and van der Linden and Reese (1998). The set of 
specifications should be large enough to guarantee a CAT with satisfactory content validity. 

Empirical Example 

A previous pool from the Armed Services Vocational Aptitude Battery (ASVAB) was 
used to study the behavior of the algorithm. The CAT version of the ASVAB is extensively 
described in Sands, Waters and McBride (1997). The pool consisted of 186 items used for the 
Arithmetic Reasoning Test. The length of the test was 15 items. The items in the pool were 
calibrated using the 3-PL model in (7). A dominant feature of the pool was its high positive 
correlation between the estimated values of the difficulty and discrimination parameters 
displayed in Figure L This correlation explains an unexpected result reported below. 

[Figure 1 about here] 

Response-time data were recorded for 38,357 examinees who took the test in 1997. The 
parameters in the response-time model in (l)-(2) were estimated substituting sample statistics 
into (3)-(6). The following results were obtained: A =4.093, ^=.515. In addition, the estimated 
item and person effects, Si and tj , were found to be distributed with a standard deviation equal 
to .497 and .375, respectively. As shown in Figure 2, the values of the estimated item effect 

[Figure 2 about here] 

parameter, Sj , increased strongly with the estimated difficulty of the items. This result is as 
expected for an arithmetic reasoning test— the more difficult the item, the more timed needed to 

solve it. Also, the correlation between © and t was found to be equal to .035, indicating that 
ability and speed were independent variables in the population of examinees. 

Note that the sample standard deviations of and t reported here no doubt overestimate 
their population parameters. However, because the sample of examinees is large, the bias in the 
estimate of the standard deviation of 8 will be negligible. Moreover, this estimate is only 
reported here as a descriptive statistic; it does not play any role in the adaptive procedure in this 
example. The bias in the estimate of Ox is expected to be larger but this quantity serves as the 
variance of the prior for tj in the adaptive procedure. The result is thus a less informative prior, 
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and hence a more conservative adaptive test. 

Also, note that the matrix of the CAT-ASVAB response times used in this example 
was not necessarily balanced with respect to the person and item effects. It is hard to ascertain 
how this has affected the estimated item effects and the standard deviation of the examinee 
effects. However, serious effects would have led to a deteriorated fit of the lognormal model. 
Since the model fit was general good (see below), confounding was not considered an 
important issue. For an actual application of the lognormal model, it is recommended to 
collect response time data through a balanced design (e.g., during item pretesting) or to make 
the matrix balanced afterwards through resampling. The latter option was not feasible here 
because for each examinee response times on only 15 items were available. 

The main goal of the study was to explore the effects of the response-time constraints 
in (21) on the statistical properties of the ability estimator. In particular, the bias and root 
mean squared error (RMSE) functions of the estimator were studied relative to those for the 
unconstrained version of the CAT-ASVAB. In an earlier study of constrained adaptive testing 
for an item pool from the Law School Admission Test (LSAT) with an 0-1 LP model with 433 
constraints representing its specifications, the bias and RMSE functions were slightly affected 
by the presence of the constraints for short test lengths but no effects remained after 30 items 
(van der Linden & Reese, 1998). 

A second goal of the study was to assess the effect of the use of the response-time 
constraints on the examinees' time needed to complete the test. In the main simulation study, 
the time limit was set equal to the one was used in the actual ASVAB test (t to t=39 mins). 
However, the limit was chosen to introduce only a mild form of speededness for the ASVAB 
examinees (Segall, Moreno, Bloxom, & Hetter, 1997). To investigate the effects of tighter 
time limits, this part of the simulation was therefore replicated for t to t=34 and 29 mins. 
Finally, to create a baseline for evaluating the effects of the response-time constraints, the 
times needed to complete the test for an unconstrained CAT version were also simulated. 

The ability values of the simulated examinees were chosen to be equal to 0 = -2.0,- 
1.5,.. .,2.0. In addition, their response times were simulated at x = -.60, -.30, .00, .30, and .60. It is 
reminded that the values of x in the sample of ASVAB examinees were estimated to be 
distributed about .00 with a standard deviation equal to .515; the x values were thus chosen to 
cover the range of values in the ASVAB population. The number of replications for each 
combination of the 6 and t values was equal to 180. 

As already mentioned, all simulations were replicated without the response-time 
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constraints on the item selection. This condition provided a base line for the evaluation of the 
constrained CATs. 

The ability estimator used was the expected value of the posterior distribution of the 
ability parameter (EAP estimator), with a uniform prior distribution over Be [-4.0, 4.0]. The 

first item was selected to be optimal at 0 = 0. The values of the parameter t“ k in (21) were set 

equal to the 50th percentile in the posterior predictive distributions for the selection of item k=l 
and moved to the 95th percentile for the selection of item k=13 in equal steps. The last value 
was maintained for items k=14 and 15. 

The LP model was solved using the First Acceptable Integer Solution Algorithm from 
the ConTEST test assembly software package; a detailed description of the algorithm is 
available in the manual (Timminga, van der Linden, & Schweizer, 1996, sect. 6.6). On a PC 
with Pentium/ 133MHz processor the times needed to update the ability estimate, solve the LP 
model, and select the most informative item were always less than 1.5 secs. 

Fit of Response-Time Distribution 

The response-time distribution in (l)-(2) was tested for its assumption of a lognormal shape 
against the assumptions of a normal, gamma, and Weibull distribution. Detailed results for the 
current data set are given in Schnipke and Scrams (1997). Only single observations were 
recorded for each item-examinee combination. However, as shown by Figure 3, the 
distribution of the values for the examinee slowness parameter, tj, approximated a normal 

[Figure 3 about here] 

distribution and therefore the marginal distribution of the log response times across the 
examinees in the ASVAB population should also be approximately normal. This feature was 
checked in depth for 30 of the 186 items. The items were selected to be answered by at least 
1,000 examinees and not to involve any figures. The samples were randomly split into halves 
used for estimating the parameters and checking the distributional assumption. The 
parameters of the four candidate models for the response-time distributions were estimated 
using the method of ML estimation. 

Double probability plots were produced for each model and each item, with the observed 
cumulative probability function along the abscissa and the estimated function along the 
ordinate. A typical example is given in Figure 4. The lognormal distribution provided the 

[Figure 4 about here] 

best fit (indicated by most points falling along the diagonal), followed by the gamma, 
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Weibull, and normal. The same essential result was found for all other items. 

Fits were also examined using the RMSE calculated between the observed and estimated 
distribution functions at each fifth percentile. The results for all 30 items are provided in 
Figure 5. For readability, items were ranked according to the quality of fit provided by the 

[Figure 5 about here] 

lognormal distribution. Again, the lognormal distribution provided the best overall fit, and the 
other three distributions were ranked as before. 

The assumption of a common standard deviation across items was tested by plotting the 
sample estimates of the standard deviations of the same marginal distributions of the log 
response times for each of the 186 items against sample sizes in Figure 6. Different items 

[Figure 6 about here] 

items were administered different numbers of times. For smaller samples much of the 
observed variability may be attributed to sampling variation. However, for larger samples, the 
estimated standard deviations should stabilize about an identifiable mean. Figure 6 shows this 
stabilization to hold indeed about the value of 0.63. 

Bias and RMSE Functions of Estimator of 0 

The bias and RMSE of the 0 estimator in the CAT algorithm were estimated with results 
displayed in Figure 7. Bias as a function of 0 was roughly flat except for the upper part of the 

[Figure 7 about here] 

scale, where a larger positive bias was obtained. This effect was observed both for the 
constrained and unconstrained CAT versions. It is assumed to be entirely due to the 
distribution of the values for the difficulty and discrimination parameters in the item pool. As 
is evident from Figure 1, examinees with a 0 value at the upper part of the scale will tend to get 
items that are too easy with an extremely high value for their discrimination parameter. Most of 
these items will yield correct responses and the ability estimates will tend to drift away. Bias in 
the estimator of 0 as a function of x was flat for all 0 values indicating no systematic impact of 
the personal slowness parameter on the errors in the 0 estimates. Observe that the altitudes of the 
lines correspond with the average bias for the 0 values in Figure 7a. 

The RMSE plot in Figure 7c shows the typical U-shaped form for a CAT with an 
initial ability estimate at 0=0. The lack of asymmetry must be the result of the squared bias 
component in the mean-squared error originated from the composition of the item pool 
discussed above. Again, both the constrained and unconstrained CAT versions have this 
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asymmetry. Figure 7d reveals that the slowness of the examinees had no impact on RMSE of the 
0 estimator. 

A comparison between the bias and RMSE for the constrained and unconstrained 
CATs revealed slightly unfavorable differences for the constrained function. The differences 
are assumed to be the result of the relatively short length of 15 items for the adaptive test. In 
the previous study with the LSAT, larger differences were found for a test length of 15 items; 
however, these differences disappeared completely as the test length approached 30 items (van 
der Linden & Reese, 1998). 

Bias and RMSE Functions of Estimator of x 

Also, the bias and RMSE of the x estimator were estimated as a function of 0 and x. The results 
are given in Figure 8. The plots show a small inward bias, typical of a Bayesian estimator, 

[Figure 8 about here] 

and a uniform RMSE as a function of x (Figures 8a and 8c). In addition, as expected, the bias 
and RMSE appear to be independent of the true value of the 0 parameter. As before, the altitudes 
of the horizontal bias functions in Figure 8b correspond with the bias at the corresponding x 
values in Figure 8a. 

Distributions of Time Left After Completion 

Because response times were sampled from the model in (l)-(2) for each item selected it was 
possible to estimate for each simulated examinee how much time was needed to complete the 
test. Figure 9a shows the average time left after completion of the CAT version without 

[Figure 9 about here] 

response-time constraints as a function of the examinee slowness parameter, x. The dotted line 
represents the time limit of 39 mins. (=2340 secs.) in the simulation; results below the line 
indicate extra time needed to finish the full test. For all 0 values, the average time remaining at 
the end of the CAT is a decreasing function of the slowness of the examinee. The majority of the 
simulated examinees appeared to complete the test in time; those who were not able to do so 
were exclusively among the ones with high 0 values. In a CAT, these examinees get the most 
difficult items, and the positive correlation between bi and & in Figure 2 shows that these items 
tend to demand more time. 

Figure 9b shows the effect of the response-time constraints. The curves for the 
examinees with the high 0 values now run more horizontally for the larger x values, and none of 
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the curves intersect the line representing the time limit. As the time limit for the ASVAB was 
rather mild, the simulations with the response-time constraints were repeated for t to t equal to 34 
mins (=2040 secs.) and 29 mins. (=1740 secs.). As Figures 9c and 9d show, the general effect 
of a tighter time limit is less time after completion, but all curves are still above the line 
representing the limits. 

Conclusion 

The purpose of this paper was to present an item-selection algorithm for CAT to 
remove the differential effects of time limits on the performances of examinees. The empirical 
example suggests that the algorithm is able to reduce the speededness of the test for the 
examinees who otherwise would have suffered from the time limit. Also, the algorithm did 
not seem to introduce any differential effects on the statistical properties of the 0 estimator. 
The differences in bias and RMSE in this estimator between the constrained and unconstrained 
CAT versions were uniform and independent of any other parameter in the simulation. Also, 
they were generally small and are expected to disappear for a longer test than the subtest from 
the ASVAB used in the example. 

The examinees in the empirical example who suffered from the time limit under the 
condition of an unconstrained CAT were thus exclusively among those with high 0 values. 
This finding was a direct result from the fact that ability and speed were independent variables in 
the examinee population whereas item difficulty and the time demanded by the item correlated 
positively. As speed and ability were uncorrelated, some of the more able examinees were slow. 
Nevertheless, as a result from the adaptive nature of the test and the positive correlation between 
item difficulty and response time, they tended to get the items that demanded most time. These 
examinees thus profited strongly from the presence of the response-time constraints in the 
proposed algorithm. 

The low impact of the time constraints on the estimator of 0 suggests that the time limit 
used for the CAT-ASVAB was ample. The additional analyses with the 34- and 29-minute time 
limits suggest that the item pool may be rich enough to support even more stringent time limits. 
Ultimately, however, there is a tradeoff between any constraint on the item selection and the 
precision of measurement. Any operational use of this algorithm would therefore require a 
detailed analyses of the richness of the item pool and the complexity of the other constraints 
imposed on the item selection. 
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Figure Captions 

Figure 1 . Bivariate plot of the estimated values of the discrimination and difficulty 
parameters in the item pool. 

Figure 2 . Bivariate plot of the estimated values of the 8 and difficulty parameters in the item 
pool. 

Figure 3 . Frequency diagram of the estimated values of the examinee slowness parameter 
with the best fitting normal density curve. 

Figure 4 . Double probability plot of the four fitted response-time distributions for a typical 
item. 

Figure 5 . RMSE of the items for the four response time-distributions. 

Figure 6 . Estimated standard deviations of the marginal log response-time distributions for 
the items as a function of the size of the examinee sample. 

Figure 7 . Estimated bias and RMSE of the 6 estimator as a function of 6 and T . 

Figure 8 . Estimated bias and RMSE of the t estimator as a function of T and 6 . 

Figure 9 . Average time left after completion of the test for CAT without (9a) and with 
response-time constraints on item selection (9b: W=39; 9c: ^01=34: 9d: 1101=29 
mins.). 
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